From the course website:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv("https://www.mth548.org/_static/kde_marathon_results/marathon_results.csv")
df["tot_minutes"] = pd.to_timedelta(df["Finish"]).dt.total_seconds()/60
df
| Age | M/F | Country | 5K | 10K | 15K | 20K | Half | 25K | 30K | 35K | 40K | Finish | Pace | Overall | Gender | Division | tot_minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | M | ETH | 00:14:43 | 00:29:43 | 00:44:57 | 01:00:29 | 01:04:02 | 01:16:07 | 01:32:00 | 01:47:59 | 02:02:39 | 02:09:17 | 00:04:56 | 1 | 1 | 1 | 129.283333 |
| 1 | 30 | M | ETH | 00:14:43 | 00:29:43 | 00:44:58 | 01:00:28 | 01:04:01 | 01:16:07 | 01:31:59 | 01:47:59 | 02:02:42 | 02:09:48 | 00:04:58 | 2 | 2 | 2 | 129.800000 |
| 2 | 29 | M | KEN | 00:14:43 | 00:29:43 | 00:44:57 | 01:00:29 | 01:04:02 | 01:16:07 | 01:32:00 | 01:47:59 | 02:03:01 | 02:10:22 | 00:04:59 | 3 | 3 | 3 | 130.366667 |
| 3 | 28 | M | KEN | 00:14:43 | 00:29:44 | 00:45:01 | 01:00:29 | 01:04:02 | 01:16:07 | 01:32:00 | 01:48:03 | 02:03:47 | 02:10:47 | 00:05:00 | 4 | 4 | 4 | 130.783333 |
| 4 | 32 | M | KEN | 00:14:43 | 00:29:44 | 00:44:58 | 01:00:28 | 01:04:01 | 01:16:07 | 01:32:00 | 01:47:59 | 02:03:27 | 02:10:49 | 00:05:00 | 5 | 5 | 5 | 130.816667 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26293 | 64 | F | USA | 00:50:15 | 01:43:31 | 02:36:53 | 03:32:26 | 03:43:46 | 04:25:53 | 05:19:44 | 06:17:19 | 07:13:34 | 07:38:56 | 00:17:31 | 26594 | 12015 | 269 | 458.933333 |
| 26294 | 61 | F | USA | 00:48:36 | 01:39:39 | 02:39:13 | 03:35:58 | 03:47:55 | 04:32:44 | 05:31:58 | 06:28:56 | 07:26:19 | 07:51:30 | 00:17:59 | 26595 | 12016 | 270 | 471.500000 |
| 26295 | 66 | F | USA | 00:53:03 | 01:47:16 | 02:41:45 | 03:37:07 | 03:48:21 | 04:33:51 | 05:38:56 | 06:38:51 | 07:36:18 | 07:59:33 | 00:18:18 | 26596 | 12017 | 91 | 479.550000 |
| 26296 | 53 | M | USA | 00:49:04 | 01:40:12 | 02:33:31 | 03:31:41 | 03:43:35 | 04:29:20 | 05:31:11 | 06:33:35 | 07:35:38 | 08:00:37 | 00:18:20 | 26597 | 14580 | 2055 | 480.616667 |
| 26297 | 62 | M | USA | 00:40:14 | 01:28:18 | 02:26:46 | 03:28:41 | 03:40:36 | 04:36:06 | 05:43:44 | 06:51:31 | 07:41:28 | 08:06:01 | 00:18:33 | 26598 | 14581 | 898 | 486.016667 |
26298 rows × 18 columns
from scipy.stats import gaussian_kde
kde = gaussian_kde(df["tot_minutes"])
plt.style.use('bmh')
plt.figure(figsize=(10, 5))
x = np.linspace(120, 500, 400)
plt.plot(x, kde(x));
kde.factor
0.13062173733308516
kde2 = gaussian_kde(df["tot_minutes"], bw_method=0.6)
plt.figure(figsize=(10, 5))
x = np.linspace(120, 500, 400)
plt.plot(x, kde2(x));
Split data by sex, and compute KDE for each group:
dfm = df[df["M/F"] == "M"]
dff = df[df["M/F"] == "F"]
kdem = gaussian_kde(dfm["tot_minutes"])
kdef = gaussian_kde(dff["tot_minutes"])
Plot density functions:
x = np.linspace(120,500, 400)
plt.figure(figsize=(10,5))
plt.plot(x, kdem(x), label="M")
plt.plot(x, kdef(x), label="F")
plt.legend()
plt.show()
Goal: Use KDE to predict sex of a runner based on the finish time.
$$ P(\text{sex} = F | \text{ time} = t_0) = \dfrac{P( \text{ time} = t_0 | \text{ sex} = F)P(\text{ sex} = F)} {P( \text{ time} = t_0)} $$Also: $$ P( \text{ time} = t_0) = P( \text{ time} = t_0 | \text{ sex} = F)P(\text{ sex} = F) + P( \text{ time} = t_0 | \text{ sex} = M)P(\text{ sex} = M) $$
Bayes classifier using KDE (from the course website).
Split marathon data into training and test parts:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.5, random_state=123)
train_df
| Age | M/F | Country | 5K | 10K | 15K | 20K | Half | 25K | 30K | 35K | 40K | Finish | Pace | Overall | Gender | Division | tot_minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1024 | 33 | M | USA | 00:18:33 | 00:37:18 | 00:56:26 | 01:15:51 | 01:20:12 | 01:35:51 | 01:56:38 | 02:18:35 | 02:40:36 | 02:50:47 | 00:06:31 | 1026 | 985 | 825 | 170.783333 |
| 14892 | 40 | F | USA | 00:26:44 | 00:53:34 | 01:20:44 | 01:47:42 | 01:53:30 | 02:14:47 | 02:42:09 | 03:09:04 | 03:34:16 | 03:45:06 | 00:08:36 | 14970 | 5196 | 914 | 225.100000 |
| 13216 | 55 | M | USA | 00:25:14 | 00:49:51 | 01:14:52 | 01:41:52 | 01:47:19 | 02:07:11 | 02:33:40 | 03:01:24 | 03:27:57 | 03:39:35 | 00:08:23 | 13277 | 9100 | 653 | 219.583333 |
| 10302 | 37 | M | USA | 00:26:17 | 00:50:37 | 01:15:09 | 01:40:43 | 01:46:17 | 02:07:12 | 02:33:12 | 02:57:48 | 03:21:09 | 03:30:35 | 00:08:02 | 10337 | 7833 | 3560 | 210.583333 |
| 13164 | 57 | M | USA | 00:25:18 | 00:49:50 | 01:14:45 | 01:39:49 | 01:45:22 | 02:05:06 | 02:31:41 | 02:59:42 | 03:27:39 | 03:39:26 | 00:08:23 | 13224 | 9080 | 647 | 219.433333 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15377 | 26 | F | USA | 00:25:18 | 00:51:06 | 01:16:33 | 01:41:52 | 01:47:26 | 02:08:08 | 02:35:55 | 03:05:23 | 03:34:32 | 03:46:50 | 00:08:40 | 15463 | 5505 | 3578 | 226.833333 |
| 21602 | 60 | F | CAN | 00:31:03 | 01:01:33 | 01:30:58 | 02:00:14 | 02:06:34 | 02:30:29 | 03:02:19 | 03:33:13 | 04:05:01 | 04:19:26 | 00:09:54 | 21787 | 9327 | 114 | 259.433333 |
| 17730 | 43 | F | USA | 00:27:06 | 00:53:30 | 01:20:51 | 01:48:06 | 01:54:00 | 02:16:06 | 02:44:59 | 03:14:54 | 03:42:59 | 03:55:26 | 00:08:59 | 17852 | 6994 | 1297 | 235.433333 |
| 15725 | 47 | F | USA | 00:27:22 | 00:53:47 | 01:19:38 | 01:47:14 | 01:53:03 | 02:13:46 | 02:41:02 | 03:08:51 | 03:36:21 | 03:48:11 | 00:08:43 | 15818 | 5737 | 649 | 228.183333 |
| 19966 | 37 | F | USA | 00:28:36 | 00:55:46 | 01:24:31 | 01:52:27 | 01:58:58 | 02:23:39 | 02:55:00 | 03:25:49 | 03:54:46 | 04:07:24 | 00:09:27 | 20116 | 8325 | 4424 | 247.400000 |
13149 rows × 18 columns
test_df
| Age | M/F | Country | 5K | 10K | 15K | 20K | Half | 25K | 30K | 35K | 40K | Finish | Pace | Overall | Gender | Division | tot_minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15955 | 47 | F | USA | 00:27:31 | 00:53:43 | 01:19:39 | 01:45:57 | 01:51:39 | 02:12:27 | 02:40:08 | 03:08:26 | 03:36:40 | 03:48:56 | 00:08:44 | 16054 | 5886 | 682 | 228.933333 |
| 1083 | 37 | M | NED | 00:18:30 | 00:37:13 | 00:56:19 | 01:15:54 | 01:20:10 | 01:35:53 | 01:57:06 | 02:20:18 | 02:42:10 | 02:51:28 | 00:06:33 | 1085 | 1039 | 870 | 171.466667 |
| 358 | 40 | M | CAN | 00:18:31 | 00:37:16 | 00:56:05 | 01:15:04 | 01:19:12 | 01:34:06 | 01:53:31 | 02:13:14 | 02:33:04 | 02:42:01 | 00:06:11 | 359 | 343 | 21 | 162.016667 |
| 5136 | 34 | F | USA | 00:20:19 | 00:41:00 | 01:01:57 | 01:23:14 | 01:27:55 | 01:44:44 | 02:07:58 | 02:34:10 | 03:00:20 | 03:12:22 | 00:07:21 | 5149 | 515 | 449 | 192.366667 |
| 16029 | 40 | F | CAN | 00:25:02 | 00:51:08 | 01:19:07 | 01:45:56 | 01:51:43 | 02:13:12 | 02:41:19 | 03:11:08 | 03:37:51 | 03:49:11 | 00:08:45 | 16128 | 5935 | 1083 | 229.183333 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19657 | 48 | F | USA | 00:27:53 | 00:58:05 | 01:26:21 | 01:54:59 | 02:01:09 | 02:23:39 | 02:52:50 | 03:23:20 | 03:52:51 | 04:05:29 | 00:09:22 | 19804 | 8137 | 1247 | 245.483333 |
| 10744 | 42 | F | GBR | 00:25:47 | 00:50:38 | 01:15:34 | 01:40:12 | 01:45:37 | 02:05:56 | 02:30:53 | 02:56:18 | 03:21:03 | 03:31:57 | 00:08:06 | 10781 | 2736 | 349 | 211.950000 |
| 17175 | 41 | M | USA | 00:24:40 | 00:53:59 | 01:21:15 | 01:48:49 | 01:54:28 | 02:16:23 | 02:44:47 | 03:13:03 | 03:40:48 | 03:53:21 | 00:08:55 | 17291 | 10665 | 1671 | 233.350000 |
| 3806 | 27 | M | USA | 00:21:44 | 00:43:56 | 01:06:09 | 01:28:25 | 01:33:19 | 01:50:49 | 02:13:21 | 02:35:36 | 02:56:51 | 03:06:07 | 00:07:06 | 3814 | 3546 | 2458 | 186.116667 |
| 10433 | 24 | F | USA | 00:23:43 | 00:46:45 | 01:09:57 | 01:33:42 | 01:38:51 | 01:57:49 | 02:24:02 | 02:51:45 | 03:18:56 | 03:30:59 | 00:08:03 | 10469 | 2570 | 2027 | 210.983333 |
13149 rows × 18 columns
Next, we construct a predictor using training data.
train_df_grouped = train_df.groupby("M/F")
train_dfm = train_df_grouped.get_group("M")
train_dff = train_df_grouped.get_group("F")
# kde for males
kdem = gaussian_kde(train_dfm["tot_minutes"])
# kde for females
kdef = gaussian_kde(train_dff["tot_minutes"])
# number of males
mc = train_dfm["tot_minutes"].count()
#number of females
fc = train_dff["tot_minutes"].count()
# probability runner is a female
prob_f = fc/(fc+mc)
# probability runner is a male
prob_m = mc/(fc+mc)
def predictor(t):
ptf = kdef(t)
ptm = kdem(t)
return (kdef(t)*prob_f)/(kdef(t)*prob_f + kdem(t)*prob_m)
predictor(240)
array([0.59711338])
Plot finish times vs predicted values:
t = np.linspace(120,360, 400)
plt.figure(figsize=(10,5))
plt.plot(t, predictor(t))
plt.axhline(0.5, ls ='--', c='r');
plt.axvline(210, ls ='--', c='r');
Add a column with predictions to the test DataFrame:
test_df["prob_F"] = predictor(test_df["tot_minutes"])
test_df.sample(5)
| Age | M/F | Country | 5K | 10K | 15K | 20K | Half | 25K | 30K | 35K | 40K | Finish | Pace | Overall | Gender | Division | tot_minutes | prob_F | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13213 | 34 | M | AUS | 00:22:23 | 00:45:14 | 01:08:38 | 01:33:01 | 01:38:27 | 01:58:44 | 02:26:19 | 02:54:58 | 03:25:40 | 03:39:35 | 00:08:23 | 13274 | 9098 | 3748 | 219.583333 | 0.583189 |
| 19021 | 67 | M | USA | 00:28:07 | 00:55:50 | 01:23:36 | 01:52:21 | 01:58:27 | 02:21:04 | 02:50:56 | 03:20:46 | 03:49:11 | 04:01:31 | 00:09:13 | 19158 | 11421 | 120 | 241.516667 | 0.596974 |
| 22174 | 29 | F | USA | 00:27:25 | 00:55:21 | 01:23:52 | 01:53:17 | 01:59:49 | 02:23:36 | 02:56:17 | 03:30:26 | 04:07:54 | 04:24:12 | 00:10:05 | 22369 | 9647 | 4850 | 264.200000 | 0.555239 |
| 17359 | 48 | F | USA | 00:27:02 | 00:54:17 | 01:21:05 | 01:48:28 | 01:54:33 | 02:16:05 | 02:44:05 | 03:12:58 | 03:41:33 | 03:54:02 | 00:08:56 | 17476 | 6754 | 909 | 234.033333 | 0.603671 |
| 13543 | 47 | M | ITA | 00:23:30 | 00:47:15 | 01:11:00 | 01:34:59 | 01:40:13 | 01:58:51 | 02:23:25 | 02:50:14 | 03:24:00 | 03:40:33 | 00:08:25 | 13607 | 9234 | 1770 | 220.550000 | 0.587547 |
Check prediction accuracy:
sum((test_df["prob_F"] > 0.5) == (test_df["M/F"] == "F"))/len(test_df)
0.6548026465890943
We can predict with high confidence that some runners are men, but such prediction is not possible for women:
test_df.query("prob_F < 0.3")["M/F"].value_counts()
M 2629 F 377 Name: M/F, dtype: int64
test_df.query("prob_F > 0.7")["M/F"].value_counts()
Series([], Name: M/F, dtype: int64)
As expected, high confidence predictions can be made for very fast runners; 75% of them finished the marathon in less than 3 hours:
test_df.query("prob_F < 0.3")["tot_minutes"].describe()
count 3006.000000 mean 180.903942 std 12.310500 min 130.366667 25% 174.404167 50% 181.833333 75% 189.616667 max 417.833333 Name: tot_minutes, dtype: float64
len(test_df[test_df['M/F'] == 'M'])/len(test_df)
0.5468856947296372
Compare with prediction accuracy of k-NN:
from sklearn.neighbors import KNeighborsClassifier
k = 50
neigh = KNeighborsClassifier(n_neighbors=k)
neigh.fit(train_df["tot_minutes"].values.reshape(-1, 1), train_df["M/F"])
knn_pred = neigh.predict(test_df["tot_minutes"].values.reshape(-1, 1))
knn_pred
array(['F', 'M', 'M', ..., 'F', 'M', 'F'], dtype=object)
print((knn_pred == test_df['M/F']).sum()/len(train_df))
0.6453722716556393
print(neigh.score(test_df["tot_minutes"].values.reshape(-1, 1), test_df['M/F']))
0.6453722716556393
neigh.predict_proba([[200]])
array([[0.36, 0.64]])
neigh.classes_
array(['F', 'M'], dtype=object)
The plot below shows classification probabilities using KNN with a given number of neighbors:
plt.figure(figsize=(10,5))
x = np.linspace(120,360, 400).reshape(-1, 1)
plt.plot(x, neigh.predict_proba(x)[:, 0])
plt.axhline(0.5, ls ='--', c='r');
plt.show()
kde = gaussian_kde(train_df["tot_minutes"], bw_method=0.3)
plt.figure(figsize=(10, 5))
x = np.linspace(120, 500, 400)
plt.plot(x, kde(x));
likelihood = kde(test_df["tot_minutes"]).prod()
likelihood
0.0
0.5**2000
0.0
log_likelihood = np.log10(kde(test_df["tot_minutes"])).sum()
log_likelihood
-28756.265601014704
choices = []
for bw in np.linspace(0.01, 0.3, 20):
kde = gaussian_kde(train_df["tot_minutes"], bw_method=bw)
log_likelihood = np.log10(kde(test_df["tot_minutes"])).sum()
print(f"bw={bw:.2f}, log_like={log_likelihood}")
choices.append((bw, log_likelihood))
bw=0.01, log_like=-28817.917595202038 bw=0.03, log_like=-28709.48265025039 bw=0.04, log_like=-28696.36462240536 bw=0.06, log_like=-28692.098901252437 bw=0.07, log_like=-28690.45313426285 bw=0.09, log_like=-28690.196250451263 bw=0.10, log_like=-28690.79905684394 bw=0.12, log_like=-28691.982108391796 bw=0.13, log_like=-28693.63706897139 bw=0.15, log_like=-28695.755302217007 bw=0.16, log_like=-28698.366808507242 bw=0.18, log_like=-28701.504011156496 bw=0.19, log_like=-28705.187108819886 bw=0.21, log_like=-28709.422653258458 bw=0.22, log_like=-28714.207903649116 bw=0.24, log_like=-28719.53606510508 bw=0.25, log_like=-28725.400168116994 bw=0.27, log_like=-28731.79516403717 bw=0.28, log_like=-28738.718637586047 bw=0.30, log_like=-28746.170695622182
choices
[(0.01, -28817.917595202038), (0.02526315789473684, -28709.48265025039), (0.04052631578947368, -28696.36462240536), (0.05578947368421053, -28692.098901252437), (0.07105263157894737, -28690.45313426285), (0.0863157894736842, -28690.196250451263), (0.10157894736842105, -28690.79905684394), (0.11684210526315789, -28691.982108391796), (0.13210526315789473, -28693.63706897139), (0.1473684210526316, -28695.755302217007), (0.16263157894736843, -28698.366808507242), (0.17789473684210527, -28701.504011156496), (0.1931578947368421, -28705.187108819886), (0.20842105263157895, -28709.422653258458), (0.2236842105263158, -28714.207903649116), (0.23894736842105263, -28719.53606510508), (0.25421052631578944, -28725.400168116994), (0.2694736842105263, -28731.79516403717), (0.2847368421052632, -28738.718637586047), (0.3, -28746.170695622182)]
m = max(choices, key=lambda x: x[1])
m
(0.0863157894736842, -28690.196250451263)
kde = gaussian_kde(train_df["tot_minutes"], bw_method=m[0])
plt.figure(figsize=(10, 5))
x = np.linspace(120, 500, 400)
plt.plot(x, kde(x));
def f(x, y):
return (x**2 + y - 11)**2 + (x + y**2 - 7)**2 - 150
x = np.linspace(0, 5, 6)
y = np.linspace(0, 5, 6)
X, Y = np.meshgrid(x, y)
X, Y
(array([[0., 1., 2., 3., 4., 5.],
[0., 1., 2., 3., 4., 5.],
[0., 1., 2., 3., 4., 5.],
[0., 1., 2., 3., 4., 5.],
[0., 1., 2., 3., 4., 5.],
[0., 1., 2., 3., 4., 5.]]),
array([[0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5.]]))
plt.plot(X, Y, 'r.');
Z = f(X, Y)
Z
array([[ 20., -14., -76., -130., -116., 50.],
[ -14., -44., -98., -140., -110., 76.],
[ -60., -82., -124., -150., -100., 110.],
[ -82., -92., -118., -124., -50., 188.],
[ -20., -14., -20., -2., 100., 370.],
[ 210., 236., 254., 300., 434., 740.]])
import plotly.graph_objects as go
fig = go.Figure(go.Surface(x=X, y=Y, z=Z))
fig.show()
x = np.linspace(-5, 5, 400)
y = np.linspace(-5, 5, 400)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
fig = go.Figure(go.Surface(x=X, y=Y, z=Z, colorscale="Picnic"))
fig.show()
plt.figure(figsize=(6, 6))
plt.contour(X, Y, Z, levels=20, colors='k');
plt.figure(figsize=(6, 6))
plt.contourf(X, Y, Z, levels=20, cmap="jet");
plt.figure(figsize=(6, 6))
cp = plt.contourf(X, Y, Z, levels=20, cmap="seismic")
plt.contour(X, Y, Z, levels=20, colors='k')
plt.colorbar(cp);
/var/folders/vd/9gpvwb493r52y4sgtl_fvtvm0000gn/T/ipykernel_45420/3984268301.py:4: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.
from matplotlib.colors import TwoSlopeNorm
plt.figure(figsize=(6, 6))
divnorm = TwoSlopeNorm(vcenter=0)
cp = plt.contourf(X, Y, Z, levels=20, cmap="seismic", norm=divnorm)
plt.contour(X, Y, Z, levels=20, colors='k')
plt.colorbar(cp);
/var/folders/vd/9gpvwb493r52y4sgtl_fvtvm0000gn/T/ipykernel_45420/695116733.py:7: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.
plt.figure(figsize=(10, 6))
plt.plot(df["tot_minutes"], df["Age"], 'r.', alpha=0.05)
[<matplotlib.lines.Line2D at 0x7f8adcc065b0>]
Note. scipy.stats.gaussian_kde takes as its argument data organized in 2-dimensional array where each column gives coordinates of one data point. For this reason, when we will use it with data coming from a DataFrame, we will need to transpose the data.
Notice that cross sections of the peaks of the KDE plot below are not circular, since scipy.stats.gaussian_kde uses the covariance matrix of the data to compute bandwidth in each direction.
data = np.array([[0, 0], [1, 0], [0, 1], [1,1]])
data
array([[0, 0],
[1, 0],
[0, 1],
[1, 1]])
data.T
array([[0, 1, 0, 1],
[0, 0, 1, 1]])
kde = gaussian_kde(data.T, bw_method=0.7)
kde([0.5, 0.7])
array([0.00918796])
x = np.linspace(-1, 2, 400)
y = np.linspace(-1, 2, 400)
X, Y = np.meshgrid(x, y)
Z = kde(np.array([X.flatten(), Y.flatten()])).reshape(X.shape)
Z
array([[0.00053435, 0.00055943, 0.00058549, ..., 0.00058549, 0.00055943,
0.00053435],
[0.00055943, 0.00058569, 0.00061296, ..., 0.00061296, 0.00058569,
0.00055943],
[0.00058549, 0.00061296, 0.00064151, ..., 0.00064151, 0.00061296,
0.00058549],
...,
[0.00058549, 0.00061296, 0.00064151, ..., 0.00064151, 0.00061296,
0.00058549],
[0.00055943, 0.00058569, 0.00061296, ..., 0.00061296, 0.00058569,
0.00055943],
[0.00053435, 0.00055943, 0.00058549, ..., 0.00058549, 0.00055943,
0.00053435]])
fig = go.Figure(go.Surface(x=X, y=Y, z=Z))
fig.show()
plt.figure(figsize=(6,6))
plt.contourf(X, Y, Z, levels=10, cmap="Reds")
plt.contour(X, Y, Z, levels=10, colors="k")
<matplotlib.contour.QuadContourSet at 0x7f8adf76bdf0>
bw_method = 0.2
kde = gaussian_kde(df[["tot_minutes", "Age"]].T, bw_method=bw_method)
kde([180, 30])
array([0.00024693])
kde.integrate_box([180, 40], [240, 50])
0.20727237255581568
x = np.linspace(120, 360, 100)
y = np.linspace(10, 70, 100)
X, Y = np.meshgrid(x, y)
Z = kde(np.array([X.flatten(), Y.flatten()])).reshape(X.shape)
plt.figure(figsize=(10,6))
plt.plot(df["tot_minutes"], df["Age"], 'r.', alpha=0.1)
plt.contour(X, Y, Z, levels=10, colors='k', zorder=10);
import seaborn as sns
df = sns.load_dataset('tips')
df
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
From the course website:
from ipywidgets import interact, fixed
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
sns.set_context("notebook")
df = sns.load_dataset('tips')
def tip_plot(frac):
frac=frac/100
plt.figure(figsize=(12,7))
sns.scatterplot(data=df, x="total_bill", y="tip", marker='o')
x = np.arange(0, 55)
plt.plot(x, frac*x, c='b', label=f"{frac:.0%} tip")
plt.ylim(0, 11)
plt.title("Total bill vs tip amount")
plt.legend()
plt.show()
interact(tip_plot, frac=(10, 20));
interactive(children=(IntSlider(value=15, description='frac', max=20, min=10), Output()), _dom_classes=('widge…
Average tip fraction:
df['tip_fraction'] = df['tip']/df['total_bill']
df
| total_bill | tip | sex | smoker | day | time | size | tip_fraction | |
|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 |
244 rows × 8 columns
mean_tip_frac = df['tip_fraction'].mean()
mean_tip_frac
0.16080258172250478
df['naive_tip_prediction'] = mean_tip_frac*df['total_bill']
df['naive_prediction_error'] = df['naive_tip_prediction'] - df['tip']
df
| total_bill | tip | sex | smoker | day | time | size | tip_fraction | naive_tip_prediction | naive_prediction_error | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 | 2.732036 | 1.722036 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 | 1.662699 | 0.002699 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 | 3.378462 | -0.121538 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 | 3.807805 | 0.497805 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 | 3.954135 | 0.344135 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 | 4.668099 | -1.251901 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 | 4.370614 | 2.370614 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 | 3.645395 | 1.645395 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 | 2.865502 | 1.115502 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 | 3.019872 | 0.019872 |
244 rows × 10 columns
df['naive_prediction_error'].mean()
0.18335196705924947
np.abs(df['naive_prediction_error']).mean()
0.7968739185842583
np.abs(df['naive_prediction_error']).describe()
count 244.000000 mean 0.796874 std 0.826432 min 0.002699 25% 0.234597 50% 0.509510 75% 1.068028 max 4.623554 Name: naive_prediction_error, dtype: float64
np.abs(df['naive_prediction_error']/df['tip']).describe()
count 244.000000 mean 0.298342 std 0.383098 min 0.001626 25% 0.094422 50% 0.179638 75% 0.327921 max 3.512093 dtype: float64
Tip amounts vs error values:
plt.figure(figsize=(12, 7))
sns.scatterplot(x='naive_tip_prediction',
y='naive_prediction_error',
data=df)
plt.axhline(0, c='k')
plt.title("Predicted tip amount vs prediction error", y=1.01);
Our goal will be to try to improve the predictions by finding values of parameters $\alpha_1, \alpha_2$ such that the function $$f(\text{total bill}) = \alpha_1\cdot (\text{total bill}) + \alpha_2$$ gives as good as possible predictions of finish times.
We need to specify first what we mean by "as good as possible". Let
We want to find a function $$f(x) = \alpha_1x + \alpha_2$$ such that the value
$$C(\alpha_1, \alpha_2) = \sum_i \left(y^{(i)} - f(x^{(i)})\right)^2 = \sum_i \left(y^{(i)} - \alpha_1x^{(i)} + \alpha_2\right)^2$$is as small as possible. Note that $x^{(i)}$ and $y^{(i)}$ are known (they are coming from the data), and the parameters $\alpha_1, \alpha_2$ are unknown. The function $C(\alpha_1, \alpha_2)$ is called the cost function.
Example.
Calculate the cost function for a few different values of $\alpha_0$ and $\alpha_1$:
choices = np.array([[0.16, 0], [0.20, 0], [0.14, 0.5]])
print("alpha_1 alpha_2 cost")
for i in range(len(choices)):
cost = ((df['total_bill']*choices[i, 0] + choices[i, 1] - df['tip'])**2).sum()
print(f"{choices[i, 0]:<9} {choices[i, 1]:<9} {cost:.2f}")
alpha_1 alpha_2 cost 0.16 0.0 317.84 0.2 0.0 650.87 0.14 0.5 294.37
Gradient descent is a method of finding a minimum of a function $f$.
Algorithm:
import plotly.graph_objects as go
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def descent(Df, x0, l_rate=0.1, nsteps=1000):
'''
Performs gradient descent of a given function f.
Df:
Differential of f
x0:
The xtarrting point.
l_rate:
The learning rate.
nsteps:
Number of iterations to run.
Returns:
A list of points computed during steps of the gradient descent.
'''
x = np.array(x0, dtype='float')
path = [x]
for i in range(nsteps):
Dfx = np.array(Df(x))
x = x - l_rate*Dfx
path.append(x)
return path
def plot_descent(f, xlim, ylim, path=None, levels=20):
'''
Creates contour plot of a functions and the path
computed by gradient descent applied to the function.
f:
Function to be plotted
path:
List of coordinates of points computed by the
gradient descent algorithm.
xlim, ylim:
Tuples with limits of x- and y-values for the contour
plot of the function.
levels:
Specifies levels of the contour plot.
'''
plt.figure(figsize=(8, 8))
x, y = np.meshgrid(np.linspace(*xlim, 1000), np.linspace(*ylim, 1000))
Z = f(np.vstack([x.ravel(), y.ravel()])).reshape(x.shape)
plt.contourf(x, y, Z, levels=levels, cmap='bone')
plt.contour(x, y, Z, levels=levels, colors='gray')
if path is not None:
plt.plot([x[0] for x in path], [x[1] for x in path], 'ro-', ms=4)
plt.show()
def plot_descent_step(f, xlim, ylim, path=None, levels=20, last=None, step=1):
plot_descent(f=f,
xlim=xlim,
ylim=ylim,
path=path[:last:step],
levels=levels)
def plot3d(f, xlim, ylim):
x = np.linspace(xlim[0], xlim[1], 400)
y = np.linspace(ylim[0], ylim[1], 400)
X, Y = np.meshgrid(x, y)
Z = f(np.array([X, Y]))
fig = go.Figure(go.Surface(x=X, y=Y, z=Z, colorscale="picnic"))
fig.update_layout(autosize=False, width=800, height=600)
fig.show()
Example 1. Gradient descent for $p(x, y) = ax^2 + by^2$.
a = 2
b = 1
def p(x):
return a*x[0]**2 + b*x[1]**2
plot3d(p, (-5, 5), (-5, 5))
def Dp(x):
return (2*a*x[0], 2*b*x[1])
path = descent(Dp, x0=[2,4], l_rate=0.1, nsteps=50)
plot_descent(p,
xlim=(-5, 5),
ylim=(-5, 5),
path=path,
levels=40)
path[-1]
array([1.61656255e-11, 5.70899077e-05])
Note. This is a good example to illustrate why before using gradient descent to compute regression of data we need to normalize data features. In the function above if values of the parameters a and b are close to each other, gradient descent converges quickly. Convergence is much slower though if one of these parameters is much larger than the other:
a = 50
b = 1
plot3d(p, (-5, 5), (-5, 5))
a = 50
b = 1
path = descent(Dp, x0=[2,4], l_rate=0.015, nsteps=1000)
plot_descent(p,
xlim=(-5, 5),
ylim=(-5, 5),
path=path,
levels=40)
path[-1]
array([1.86652724e-301, 2.36479913e-013])
From the course website:
def h(x):
'''
Himmelblau's function
h(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2
'''
return (x[0]**2 + x[1] - 11)**2 + (x[0] + x[1]**2 - 7)**2
def Dh(x):
return np.array([
2 * (x[0]**2 + x[1] - 11) * 2 * x[0] + 2 * (x[0] + x[1]**2 - 7),
2 * (x[0]**2 + x[1] - 11) + 2 * (x[0] + x[1]**2 - 7) * 2 * x[1]
])
def r(x):
'''
Rosenbrock function
r(x, y) = (1-x)^2 + 100(y-x^2)^2
'''
return (1-x[0])**2 + 100*(x[1]-x[0]**2)**2
def Dr(x):
return np.array([-2*(1-x[0]) - 400*(x[1]-x[0]**2)*x[0], 200*(x[1]-x[0]**2)])
Himmelblau function:
plot3d(h, (-5, 5), (-5, 5))
path = descent(Dh, x0=[-1,1], l_rate=0.01)
plot_descent(h,
xlim=(-5, 5),
ylim=(-5, 5),
path=path,
levels=np.exp(np.linspace(-7, 6.5, 40)) - 2)
Rosenbrock function:
plot3d(r, (-1.4, 1.4), (-0.5, 1.4))
path = descent(Dr, x0=[-0.5,1], nsteps=20000, l_rate=0.001)
plot_descent(r,
xlim=(-1.4, 1.4),
ylim=(-0.5, 1.4),
path=path,
levels=np.exp(np.linspace(-7, 6.5, 40)) - 2)
Recall:
We want to find a function $$f(x) = \alpha_1x + \alpha_2$$ such that the value
$$C(\alpha_0, \alpha_1) = \sum_i \left(y^{(i)} - f(x^{(i)})\right)^2 = \sum_i \left(y^{(i)} - \alpha_1x^{(i)} + \alpha_2\right)^2$$is as small as possible.
Generalization:
Assume that we have a collection of data $\{(x^{(i)}, y^{(i)})\}$ where:
We want to find a function $f(x) = \sum_i\alpha_ix_i$ such that the value
$$C(\alpha_0, {\dots}, \alpha_n) = \sum_i \left(y^{(i)} - f(x^{(i)})\right)^2$$is as small as possible.
To use the gradient descent we need to compute $\nabla C(\alpha_0, {\dots}, \alpha_n)$. Check:
$$\nabla C(\alpha_0, {\dots}, \alpha_n) = \sum_i -2\left(y^{(i)} - f(x^{(i)})\right)\cdot \left(x_1^{(i)}, {\dots}, x_n^{(i)}\right) = \sum_i 2 \left(f(x^{(i)}) - y^{(i)}\right)\cdot \left(x_1^{(i)}, \dots, x_n^{(i)}\right)$$sklearn¶from sklearn.linear_model import LinearRegression
reg = LinearRegression()
df
| total_bill | tip | sex | smoker | day | time | size | tip_fraction | naive_tip_prediction | naive_prediction_error | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 | 2.732036 | 1.722036 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 | 1.662699 | 0.002699 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 | 3.378462 | -0.121538 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 | 3.807805 | 0.497805 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 | 3.954135 | 0.344135 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 | 4.668099 | -1.251901 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 | 4.370614 | 2.370614 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 | 3.645395 | 1.645395 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 | 2.865502 | 1.115502 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 | 3.019872 | 0.019872 |
244 rows × 10 columns
reg.fit(df[['total_bill']], df['tip'])
LinearRegression()
reg.coef_, reg.intercept_
(array([0.10502452]), 0.9202696135546731)
reg.predict(df[['total_bill']])[:10]
array([2.70463616, 2.00622312, 3.12683472, 3.40725019, 3.5028225 ,
3.57633966, 1.84133463, 3.74332864, 2.49983836, 2.47253198])
df['regression'] = reg.predict(df[['total_bill']])
df
| total_bill | tip | sex | smoker | day | time | size | tip_fraction | naive_tip_prediction | naive_prediction_error | regression | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 | 2.732036 | 1.722036 | 2.704636 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 | 1.662699 | 0.002699 | 2.006223 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 | 3.378462 | -0.121538 | 3.126835 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 | 3.807805 | 0.497805 | 3.407250 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 | 3.954135 | 0.344135 | 3.502822 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 | 4.668099 | -1.251901 | 3.969131 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 | 4.370614 | 2.370614 | 3.774836 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 | 3.645395 | 1.645395 | 3.301175 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 | 2.865502 | 1.115502 | 2.791807 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 | 3.019872 | 0.019872 | 2.892630 |
244 rows × 11 columns
df['regression_error'] = df['regression'] - df['tip']
df
| total_bill | tip | sex | smoker | day | time | size | tip_fraction | naive_tip_prediction | naive_prediction_error | regression | regression_error | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 | 2.732036 | 1.722036 | 2.704636 | 1.694636 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 | 1.662699 | 0.002699 | 2.006223 | 0.346223 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 | 3.378462 | -0.121538 | 3.126835 | -0.373165 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 | 3.807805 | 0.497805 | 3.407250 | 0.097250 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 | 3.954135 | 0.344135 | 3.502822 | -0.107178 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 | 4.668099 | -1.251901 | 3.969131 | -1.950869 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 | 4.370614 | 2.370614 | 3.774836 | 1.774836 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 | 3.645395 | 1.645395 | 3.301175 | 1.301175 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 | 2.865502 | 1.115502 | 2.791807 | 1.041807 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 | 3.019872 | 0.019872 | 2.892630 | -0.107370 |
244 rows × 12 columns
df[['regression_error', 'naive_prediction_error']].describe()
| regression_error | naive_prediction_error | |
|---|---|---|
| count | 2.440000e+02 | 244.000000 |
| mean | -5.223508e-16 | 0.183352 |
| std | 1.019943e+00 | 1.134396 |
| min | -3.743435e+00 | -3.984181 |
| 25% | -4.863111e-01 | -0.467431 |
| 50% | 9.744499e-02 | 0.091666 |
| 75% | 5.651615e-01 | 0.566811 |
| max | 3.198225e+00 | 4.623554 |
np.abs(df[['regression_error', 'naive_prediction_error']]).describe()
| regression_error | naive_prediction_error | |
|---|---|---|
| count | 244.000000 | 244.000000 |
| mean | 0.745825 | 0.796874 |
| std | 0.694074 | 0.826432 |
| min | 0.002632 | 0.002699 |
| 25% | 0.276832 | 0.234597 |
| 50% | 0.541028 | 0.509510 |
| 75% | 0.999040 | 1.068028 |
| max | 3.743435 | 4.623554 |
(df[['regression_error', 'naive_prediction_error']]**2).sum()
regression_error 252.788744 naive_prediction_error 320.908310 dtype: float64
plt.figure(figsize=(12,7))
sns.scatterplot(data=df, x="total_bill", y="tip", marker='o')
x = np.arange(0, 55)
plt.plot(x, 0.16*x, c='b', label="16% tip")
plt.plot(x, reg.coef_*x + reg.intercept_, c='g', label="regression")
plt.ylim(0, 11)
plt.title("Total bill vs tip amount")
plt.legend()
plt.show()
total_bill and size¶reg2 = LinearRegression()
reg2.fit(df[['total_bill', 'size']], df['tip'])
reg2.coef_, reg.intercept_
(array([0.09271334, 0.19259779]), 0.9202696135546731)
df['regression_2'] = reg2.predict(df[['total_bill', 'size']])
df
| total_bill | tip | sex | smoker | day | time | size | tip_fraction | naive_tip_prediction | naive_prediction_error | regression | regression_error | regression_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 | 2.732036 | 1.722036 | 2.704636 | 1.694636 | 2.629340 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 | 1.662699 | 0.002699 | 2.006223 | 0.346223 | 2.205394 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 | 3.378462 | -0.121538 | 3.126835 | -0.373165 | 3.194645 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 | 3.807805 | 0.497805 | 3.407250 | 0.097250 | 3.249592 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 | 3.954135 | 0.344135 | 3.502822 | -0.107178 | 3.719157 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 | 4.668099 | -1.251901 | 3.969131 | -1.950869 | 3.938206 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 | 4.370614 | 2.370614 | 3.774836 | 1.774836 | 3.574089 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 | 3.645395 | 1.645395 | 3.301175 | 1.301175 | 3.155952 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 | 2.865502 | 1.115502 | 2.791807 | 1.041807 | 2.706292 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 | 3.019872 | 0.019872 | 2.892630 | -0.107370 | 2.795297 |
244 rows × 13 columns
df['regression_2 error'] = df['regression_2'] - df['tip']
np.abs(df[['regression_error', 'regression_2 error']]).describe()
| regression_error | regression_2 error | |
|---|---|---|
| count | 244.000000 | 244.000000 |
| mean | 0.745825 | 0.739004 |
| std | 0.694074 | 0.685833 |
| min | 0.002632 | 0.002048 |
| 25% | 0.276832 | 0.273011 |
| 50% | 0.541028 | 0.536010 |
| 75% | 0.999040 | 0.959264 |
| max | 3.743435 | 4.042497 |
Upshot: Adding size data does not seem to improve meaningfully the fit of the model.
reg.score(df[['total_bill']], df['tip'])
0.45661658635167657
reg2.score(df[['total_bill', 'size']], df['tip'])
0.46786930879612587
total_bill and day:¶Create dummy variables for the day column:
dummies = pd.get_dummies(df['day'])
dummies
| Thur | Fri | Sat | Sun | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 2 | 0 | 0 | 0 | 1 |
| 3 | 0 | 0 | 0 | 1 |
| 4 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... |
| 239 | 0 | 0 | 1 | 0 |
| 240 | 0 | 0 | 1 | 0 |
| 241 | 0 | 0 | 1 | 0 |
| 242 | 0 | 0 | 1 | 0 |
| 243 | 1 | 0 | 0 | 0 |
244 rows × 4 columns
df = pd.concat([df, dummies], axis=1)
df
| total_bill | tip | sex | smoker | day | time | size | tip_fraction | naive_tip_prediction | naive_prediction_error | regression | regression_error | regression_2 | regression_2 error | Thur | Fri | Sat | Sun | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 | 2.732036 | 1.722036 | 2.704636 | 1.694636 | 2.629340 | 1.619340 | 0 | 0 | 0 | 1 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 | 1.662699 | 0.002699 | 2.006223 | 0.346223 | 2.205394 | 0.545394 | 0 | 0 | 0 | 1 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 | 3.378462 | -0.121538 | 3.126835 | -0.373165 | 3.194645 | -0.305355 | 0 | 0 | 0 | 1 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 | 3.807805 | 0.497805 | 3.407250 | 0.097250 | 3.249592 | -0.060408 | 0 | 0 | 0 | 1 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 | 3.954135 | 0.344135 | 3.502822 | -0.107178 | 3.719157 | 0.109157 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 | 4.668099 | -1.251901 | 3.969131 | -1.950869 | 3.938206 | -1.981794 | 0 | 0 | 1 | 0 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 | 4.370614 | 2.370614 | 3.774836 | 1.774836 | 3.574089 | 1.574089 | 0 | 0 | 1 | 0 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 | 3.645395 | 1.645395 | 3.301175 | 1.301175 | 3.155952 | 1.155952 | 0 | 0 | 1 | 0 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 | 2.865502 | 1.115502 | 2.791807 | 1.041807 | 2.706292 | 0.956292 | 0 | 0 | 1 | 0 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 | 3.019872 | 0.019872 | 2.892630 | -0.107370 | 2.795297 | -0.204703 | 1 | 0 | 0 | 0 |
244 rows × 18 columns
reg3 = LinearRegression()
reg3.fit(df[['total_bill', 'Thur', 'Fri', 'Sat', 'Sun']], df['tip'])
reg3.coef_, reg.intercept_
(array([ 0.1046728 , -0.01132963, 0.00755391, -0.07843209, 0.08220781]), 0.9202696135546731)
df['regression_3'] = reg3.predict(df[['total_bill', 'Thur', 'Fri', 'Sat', 'Sun']])
df['regression_3 error'] = df['regression_3'] - df['tip']
np.abs(df[['regression_error', 'regression_3 error']]).describe()
| regression_error | regression_3 error | |
|---|---|---|
| count | 244.000000 | 244.000000 |
| mean | 0.745825 | 0.739488 |
| std | 0.694074 | 0.697760 |
| min | 0.002632 | 0.001820 |
| 25% | 0.276832 | 0.252676 |
| 50% | 0.541028 | 0.531382 |
| 75% | 0.999040 | 1.010272 |
| max | 3.743435 | 3.828128 |
reg.score(df[['total_bill']], df['tip'])
0.45661658635167657
reg3.score(df[['total_bill', 'Thur', 'Fri', 'Sat', 'Sun']], df['tip'])
0.45887402603123395
Upshot: The day data does not improve the model.